Phrase-Based Document Categorization
نویسندگان
چکیده
(Chapter in Springer book ”Current Challenges in Patent Information Retrieval”, to appear in May 2011) This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization.
منابع مشابه
Loose Phrase String Kernels
When representing textual documents by feature vectors for the purposes of further processing (e.g. for categorization, clustering, or visualization), one possible representation is based on “loose phrases” (also known as “proximity features”). This is a generalization of n-grams: a loose phrase is considered to appear in a document if all the words from the phrase occur sufficiently close to e...
متن کاملAuTopicGen: Rule based Positional Pattern Approach for Topic Collection in IR
IR systems consist of phases like document preprocessing, indexing, query expansion, query matching, ranking etc. The document preprocessing phase is the most important phase to parse the document and collect keywords. Relevance of overall IR system improves if main topics of document are perfectly identified during this phase. It is a known fact that Topics are mostly phrase based. Existing ph...
متن کاملLearning Monolingual Compositional Representations via Bilingual Supervision
Bilingual models that capture the semantics of sentences are typically only evaluated on cross-lingual transfer tasks such as cross-lingual document categorization or machine translation. In this work, we evaluate the quality of the monolingual representations learned with a variant of the bilingual compositional model of Hermann and Blunsom (2014), when viewing translations in a second languag...
متن کاملUsing Noun Phrase Heads to Extract Document Keyphrases
Automatically extracting keyphrases from documents is a task with many applications in information retrieval and natural language processing. Document retrieval can be biased towards documents containing relevant keyphrases; documents can be classified or categorized based on their keyphrases; automatic text summarization may extract sentences with high keyphrase scores. This paper describes a ...
متن کاملText Mining
“Bag of words” model, acronym extraction, authorship ascription, coordinate matching, data mining, document clustering, document frequency, document retrieval, document similarity metrics, entity extraction, hidden Markov models, hubs and authorities, information extraction, information retrieval, key-phrase assignment, key-phrase extraction, knowledge engineering, language identification, link...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011